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Abstract 

The normalized maximum likelihood (NML) is a recent penalized likelihood that 
has properties that justify defining the amount of discrimination information (DI) in 
the data supporting an alternative hypothesis over a null hypothesis as the logarithm of 
an NML ratio, namely, the alternative hypothesis NML divided by the null hypothesis 
NML. The resulting DI, like the Bayes factor but unlike the p-value, measures the 
strength of evidence for an alternative hypothesis over a null hypothesis such that 
the probability of misleading evidence vanishes asymptotically under weak regularity 
conditions and such that evidence can support a simple null hypothesis. Unlike the 



Bayes factor, the DI does not require a prior distribution and is minimax optimal in a 
sense that does not involve averaging over outcomes that did not occur. Replacing a 
(possibly pseudo-) likelihood function with its weighted counterpart extends the scope 
of the DI to models for which the unweighted NML is undefined. The likelihood weights 
leverage side information, either in data associated with comparisons other than the 
comparison at hand or in the parameter value of a simple null hypothesis. Two case 
studies, one involving multiple populations and the other involving multiple biological 
features, indicate that the DI is robust to the type of side information used when that 
information is assigned the weight of a single observation. Such robustness suggests 
that very little adjustment for multiple comparisons is warranted if the sample size is 
at least moderate. 

Keywords: indirect information; information criteria; information for discrimination; min- 
imum description length; model selection; multiple comparison procedure; multiple testing; 
normalized maximum likelihood; penalized likelihood; reduced likelihood; weighted likeli- 
hood 
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1 Introduction 



1.1 Quantifying statistical evidence 

Many areas of science involve investigations of whether some effect is present and thus call 
for statistical methods that assess the evidence pertaining to whether a null hypothesis or 
an alternative hypothesis is closer to the system studied. For example, many experimental 
biologists are more interested in whether gene expression levels differ between control and 
treatment groups than in the effect size itself. 

Because not all samples are representative of their populations, the amount of evidence 
against the null hypothesis is misleadingly high for some samples. Although the probability 
of observing such an unrepresentative sample should decrease as the size of the sample 
increases, that is not the case if proximity of a p-value to is interpreted as the strength of 
evidence against the null hypothesis. Indeed, the distribution of the p-value associated with 
a simple (point) null hypothesis remains the same at all sample sizes if the null hypothesis 
holds, making the p-value impossible to interpret as a level of evidence apart from considering 



the sample size, as Royall (1997), Blume and Peipert (2003), and others have argued; cf. 



Efron and Gous (2001) on the sample-size incoherence of significance testing. Bickel (2010c) 
defined the lacking property by calling a measure of evidence interpretable if its probability 
of misleading evidence vanishes asymptotically. That is, a measure of evidence satisfies 
the interpretability condition only if the frequentist probability of observing a sample that 
has misleading evidence exceeding some fixed threshold converges to as the sample size 
diverges. 

Another adverse consequence of treating the p-value as a measure of evidence is its 
inability to indicate evidence in favor of a simple null hypothesis. In general, the amount 
of information in the data that favors a simple null hypothesis cannot be quantified by the 
p-value since it can only indicate whether there is evidence against it. 

The Bayes factor in principle overcomes the above limitations of the p-value but poses 
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the notorious problem of specifying the prior distribution of a nuisance parameter that is 
not random in the frequentist sense. Any solution to the problem has practical implications 



since the Bayes factor is sensitive to prior specification (Kass and Raftery, 1995). 



Subjective prior distributions have the advantage of coherence and yet are rarely used in 
data analysis since they depend on arbitrary choices in prior specification. On the other hand, 
the improper prior distributions generated by conventional algorithms cannot be directly 
applied to model selection since they would leave the Bayes factor undefined. That has 
been overcome to some extent by dividing the data into training and test samples, with 
the training samples generating proper priors for use with test samples, but at the expense 
of requiring the specification of training samples and, in the presence of multiple training 



samples, a method of averaging (Berger and Pericchi, 1996). Further, the interpretation of 



the resulting posterior probability is not clear except perhaps as an approximation to an 



agent's level of belief (Bernardo, 1997). 



1.2 Repeated-sampling optimality 

Since there are many potential measures of evidence, most notably the Bayes factors defined 
by different priors, that satisfy the criteria that a measure of evidence be interpretable 
and that it can support a simple null hypothesis, an optimality criterion will be applied to 
determine a unique method of hypothesis testing and more general model selection. Before 
doing so, that criterion will be distinguished from standards of optimality in the received 
framework of statistics, that of Neyman and Pearson as generalized by Wald. 

The goal of minimizing risk, the expected loss with respect to a sampling distribution 



(Wald, 1961), has provided a unified framework of estimation and testing and, as briefly 



reviewed in Bickel (2010a, §3.1), has led to recent multiple comparison procedures. However, 



Fraser and Reid (1990), Fraser (2004), Sprott (2004), and other frequentist statisticians 



have criticized the framework for promoting opportunistic trade-offs between hypothetical 
samples, thereby potentially misleading scientists and yielding unacceptably pathological 
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procedures. The main non-Bayesian alternative involves replacing the marginal sampling 
distribution with a conditional sampling distribution given an exact or approximate ancillary 



statistic (e.g., |Sprott[|2000[ §3.3). 

While conditioning on an ancillary statistic makes the reference distribution more rele- 



vant to inference on the basis of the observed sample (Fisher, 1973), it still does not permit 



statements about the actual loss incurred. For example, the confidence level remains the 
proportion of confidence intervals corresponding to repeated samples that cover the param- 



eter of interest. Although the use of exact confidence intervals minimizes a risk (Cornfield 



1969, Bickel, 2009), it is silent regarding the loss associated with the observed sample. 



1.3 Observed-sample optimality 
1.3.1 Information-theoretic inference 

In order to address the issues outlined above, this paper continues the development of a new 
information-theoretic alternative to previous approaches to statistical inference. The con- 
cept of a predictive distribution will enable defining minimax optimality without repeated- 
sampling or posterior-distribution averages. This approach is presented here largely without 



the terminology of its origin in universal source coding (Shtarkov, 1987). 



Consider the observed data vector x G X n . Let £ (Q) denote the set of all probability 
density functions on any sample space Q, and let T = {f^ : G $} C £ (X n ) denote a 
parametric family of density functions on X n for parameter space $. (Herein, the probability 
densities are Radon-Nikodym derivatives, reducing to probability masses if X is countable.) 
The maximum likelihood estimate of 0, denoted by (x), is assumed to be unique. 

The regret of a predictive density / G £ (X n ) is the logarithmic loss 



reg (f,x; $) 



log / (x) - inf (- log fa (x)) = log ^ 

v / (x) 
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for any x E X n . The £ {X n )- optimal predictive density function relative to J 7 , 



fo = ar g _ inf sup reg (/, u; $) , 



(2) 



while by definition in £(X n ), is not necessarily in J 7 . Rather, f Q is a probability density 
function that represents the entire family J 7 with a single distribution, much as does a prior 
predictive density function. Instead of averaging the members of T with respect to a prior 
distribution, the present definition employs T in equation (|2]) for each u E X n through the 
maximization of the likelihood over <p E $, as seen by substituting u for x in equation ([!]). 

Originally motivated in the information theory literature by a need to minimize code- 
length (Shtarkov, 1987), equation ^ defines the type of minimax optimality employed as 



opposed to the optimality of Section 1.2| (According to the minimum description length 
principle, each family of distributions corresponds to an algorithm of most efficiently encod- 



ing the information in x (Rissanen, 2007, Griinwald, 2007; Rissanen, 2009).) The predictive 



density function fo is optimal in that it solves the minimax problem involving all u E X n , and 
thus for the observed sample x E X n , rather than the more usual minimax problem involving 
an expectation value over all samples, as in the standard decision theory of frequentism. The 



following result (Shtarkov, 1987, Rissanen, 2007, Griinwald, 2007), to be proved in Section 



2.2. 1| for a more general optimization problem, sheds light on the nature of the optimality 
considered. 

Theorem 1. If J Xn j (u) du < oo, then the £ (X n )- optimal predictive density function 
relative to J 7 is 

U(.) (•) 



fo 



/<>(•;*) 



(3) 



Proof. This proof by contradiction is based on the direct proof given by Griinwald (2007 



§6.2.1). Assume, contrary to the claim, that the density function /q that satisfies equation ^ 
is not the optimal predictive density function. Since, for any v E X n , the ratio fo (v) / f^i v \ (v) 
does not depend on v, it follows that, for any f E £ (X n ) \ {fo}, there is a v E X n such that 



/ ( v ) If fa) ( v ) < fo 0) If fa) ( v )- Therefore, given any f E £ {X n ) \{f ], there is a v G X n 
such that reg (/o, v; $) < reg (^f, v; $J , which contradicts the assumption. □ 

Note that u, the dummy variable of integration over X n , appears twice in the integrand. 
For the observed x G X n , the quantity fo (x) = fo (x; $) is called the normalized maximum 
likelihood (NML) with respect to 

According to Theorem [TJ the minimax optimality (|2| of fo guarantees that reg (fo,x; $), 
the regret due to the observed sample, cannot exceed sup ugiY „ reg (fo,u; <£>), the regret due 
to the worst-case sample. In that sense, fo is optimal for the observed sample. By contrast, 
standard frequentist optimality, concerned only with loss averaged over all possible samples, 
guarantees no bound on the loss inflicted by any individual sample. 

Such observed-sample optimality justifies selecting the model or hypothesis corresponding 
to the family Ti of distributions that minimizes — log fo {x; $;), the observed prediction error 
of the ith among a finite number of distribution families under consideration. Following 



the terminology of Kullback (1968) and Bickel (2010b), - log/ (x; {0 O }) - (- log/ 0; $)) 



would be the information in x for discrimination in favor of the alternative hypothesis that 
7^ 0o ov er the null hypothesis that = 0o- Such information is an interpretable measure 
of evidence under general conditions and can quantify the strength of any evidence in favor 
of the null hypothesis as well as that of any evidence against it. More importantly, the 
information for discrimination optimally quantifies the difference in how well each model or 
hypothesis predicts relative to ideal predictors of individual samples rather than relative to 
unknown true distributions, the ideal predictors in the sense of averages over samples. The 
Kullback-Leibler risk, for example, only measures mean discrimination information relative 
to unknown ideal predictors in the average sense. 

Since the base of the logarithm is inconsequential, it may be chosen for convenience 
of interpretation. The binary logarithm (log 2 ), yielding the number of bits of information, 
enables not only immediate exponentiation back to the ratio domain but also the use of grades 
of evidence that are both broad enough and refined enough for applications across scientific 
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Information (bits) 


(0,1) 


[1,2) 


[2,3) 


[3,5) 


[5,7) 


[7,oo) 


Evidence grade 


Negligible 


Weak 


Moderate 


Strong 


Very strong 


Overwhelming 



Table 1: Heuristic grades of evidence for an alternative hypothesis over a null hypothesis 
corresponding to intervals of the information for discrimination. The absolute value of a 
negative amount of information gives the grade of evidence favoring the null hypothesis. 

disciplines (Table [TJ). Except for the distinction between negligible and weak evidence, the 



grades closely mirror those Jeffreys (1948) originally proposed for the Bayes factor; cf. Bickel 
(2010c). Accordingly, the [3, 5) grade of Table [TJis what Royall (1997, §1.12) considers "fairly 
strong evidence" for one simple hypothesis over another, and the [5, 7) and [7, oo) grades 
together constitute his "quite strong evidence." 



1.3.2 Extension of information-theoretic inference 

Despite the unique observed-sample optimality of the NML for quantifying discrimination 
information, three shortcomings make it impractical for use in many biostatistics applica- 
tions. First, since such applications typically partition into an interest parameter 9 and 
a nuisance parameter A, the regret is relative to an ideal distribution determined by maxi- 
mizing the likelihood not only over 9 but also over A. As a result, the ideal member of the 
family of distributions would be considered a better predictor than another member that has 
the same value of 9 on the basis of having a different value of A, which should be irrelevant. 
Thus, the NML is inadequate for testing hypotheses about 9 in the presence of A. 

Second, the NML only uses information that is in x, but considering such information 
about the parameter in isolation from other available information can be misleading unless 
the sample size is sufficiently large. Additional information may be available in data from 
other populations, from other biological features such as genes or SNPs, or from other feature- 
feature comparisons. Even in the absence of such incidental information, there would be some 
information in the fact that the null hypothesis that <fi = <ft is seriously considered. 

Third, the normalizing denominator of equation the logarithm of which is called 
the parametric complexity of J 7 , is infinite for typical families of distributions, including 



8 



the normal family. Each of the variant NMLs proposed to address the problem introduces 



Lanterman 


2005; 


Griinwald 


2007 



(|2007j §5.2.4), |Rissanen and Roos| ( |2007| ), and[Grunwaldj(|2007j §11.4.2) proposed conditional 
versions of the NML. Cf. related work by |Takimoto and Warmuth| ( |2000[ ). 

To overcome the first of the three identified problems with NML, it is generalized in 
Section [2] by replacing the original data with a statistic that is a function of the data and 
that has a distribution depending on 9 but not on A. Since the information in the data 
relevant to the interest parameter is largely confined to the statistic, that information can be 
better quantified in terms of the distribution of the statistic than in terms of the distribution 
of the original data, the latter depending on the value of the nuisance parameter. In terms 



of the minimum description length (MDL) metaphor (Rissanen 2007 Griinwald 2007), the 
data are first compressed with little information loss by reduction to a smaller-dimensional 
statistic and then further compressed by the family of distributions. 

The use of a weighted likelihood addresses the second problem in the same section, which 
also includes some results relevant to the probability of observing misleading information. 
(The weighted likelihood was originally proposed for bias-variance trade-offs given relatively 



small rii but potentially large N (Feifang, 2002). More formally, Wang and Zidek (2005) 
derived the weighted likelihood from the minimization of Kullback-Leibler loss.) 

As a by-product for commonly used distribution families, that solution to the second 
problem automatically solves the third problem, as illustrated in Section [3] with a multiple- 
population data set and a multiple-feature data set. Finally, Section|4]concludes by highlight- 
ing desirable properties of the new NML-based measure of information for discrimination. 
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2 Optimal inference 



2 . 1 Preliminaries 
2.1.1 Weighted likelihood 



The framework of Section |1.3| is generalized by the use of data reduction to eliminate a 
nuisance parameter in 0. Consider a measurable map r n : X n — > T(n). Let 9 : $ — > 
denote a subparameter function such that the probability density of r n (X) is g$u) (r n {X)), 
abbreviated as g$ (r (X)); the dependence of the density function gg on n is suppressed. Thus, 
the reduction of the data X to a statistic r (X) has the effect of replacing the full parameter 
<f) with the interest parameter 9. Important special cases of L(9;t(x)) = ge(r(x)) as a 
function of 9 are conditional likelihood functions and marginal likelihood functions ( |Royall 



1997, 


Severini 


2000 


Bickel 


2010b) 



The framework is now extended to N hypotheses or comparisons. Let Q n ^ = {g n ,i,e '■ 9 G 0} C 
S (T (n)) denote the parametric family of density functions on T (n) for parameter space 
0. The "n" and "i" subscripts will be dropped when their values are clear. For the 
ith of N null hypotheses or comparisons, suppose Xi G X Ui is a realization of the ran- 
dom vector Xi of independent components. Then each Tj = r (Xj) is distributed with 
density gg i = g ni ,ifiii and each outcome t« = r (xj) is an element of % = T(nj). Let 
Lj (9; t (xi)) = g n ,i,e { T — ge ( r ( x i))i giving each comparison its own likelihood function. 

Mapping X n% to % = M D is common in data reduction applications in which = IR D . 
Assigning a common parametric family to all comparisons (Gn,i — Qn,i f° r an i) is usually 
appropriate when each comparison corresponds to a biological feature, as in Section [3~2 



The observation x = (x\, . . . ,xn) generates the test statistic vector t = (ti, . . . , tjv) — 
(r (x±) , . . . , r (xjv))j an outcome of T = (T±, . . . , Tjv) = (t (Xi) , . . . , r (Xjv))- For inference 
about 9i on the basis of t, the weighted likelihood function L t (•; t) : — > [0, oo) is defined 
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by 

N 

log Li (9i, t) = ^2 Wij log Lj (Of, tj) , (4) 
where the weights Wi = (wn, . . . , w^) are real numbers that may depend on (ni, . . . , n^) and 



that satisfy wa > w%j (Feifang, 2002). The weights normally also conform to Ylf=i w ij = 1j 



a requirement that will be temporarily relaxed in Section 2.2.1 



Example 2. In most microarray studies, the expression levels of TV genes are measured with 
the goal of determining which genes are differentially expressed between a treatment / perturbation 
group of m replicates and a control group of n replicates; each of these biological replicates 
represents one or more organisms. (Single-channel arrays do not require the pairing of repli- 
cates between groups as did the dual-channel arrays.) Following the typical assumption that 
intensity values are lognormally distributed, let Xi = (xa, . . . ,Xi m ) and yi = (yn, . . . ,yi n ) 
denote the logarithms of the m and n intensities of the ith gene in the perturbation and 
control group, respectively. For small numbers of replicates, the assumption of a common 
variance within each group is useful: Xij ~ N(£,,of) and Yy ~ N(j]i,crf) with realized 
values = x^ for j = 1, . . . , m and = yij for j = 1, . . . ,n. If ^ is the absolute value of 
the inverse coefficient of variation (£j — ja^ then ti is conveniently taken as the absolute 
value of the two-sample, equal-variance t-statistic, which has a noncentral t distribution with 
noncentrality parameter (mT 1 + n~ v ) and m + n — 2 degrees of freedom. 

The sampling distribution of T is denoted by P to specify properties of the weights while 
accommodating model misspecification, the case that there is not a 6i G G such that g$ i 
is a density admitted by the marginal distribution P (Ti e •) for all i G {1, . . . , iV}. With 
suitable weights and the assumption that &i (T) = arg sup ege L, L (9; T) is almost surely unique 
for all % G {1, . . . , N}, the difference between 9{ (T) and the conventional maximum likelihood 
estimator of 0j almost surely converges to as rij diverges with iV held fixed. Specifically, 
wu = 1 + op (1) and i ^ j =^ — op (1) ensure that 9{ (T) = argsup^gQ L (9; Xi) + op (1), 
where the term op (1) converges to with P-probability 1 as n, — > oo with any ratio nj/nk 
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bounded by constants: nj = O (n^) for all j, k G {1, . . . , N}. 



Li I 6i (t) : t 



2.1.2 Predictive loss 

For some g & £ (T), the generalized regret 



regi (0, t; 6) = - log# (r (2*)) - inf (- logLj (0; t)) = log _ (5) 

eee g (r (xi)) 

measures loss incurred by the likelihood associated with g, the predictive distribution, relative 
to Li \ §i (t) ; , the maximum weighted likelihood of In other words, regj (g, t; G) is the 
discrepancy between error in predicting the value of r (x) on the basis of g and the prediction 
error minimized over the interest parameter. The latter error is more relevant to hypotheses 
about the value of 9 than a prediction error minimized over the full parameter 0, including 



the nuisance parameter A (£1.3). Thus, regj (g,t; 0) replaces reg (f,x; $) as the regret in 



the presence of the nuisance parameter or a nonzero weight other than Wu. 

2.2 Optimal predictive distribution 
2.2.1 Exact predictive distribution 

For each t £ 71, let tj (t) denote the iV-tuple of statistics that is equal to t in all components 
except the ith, which has t in place of £«. For example, ti (t) = (t,t2, ■ ■ ■ , tjv), but t.; (t) = 
(ti, . . . , t, t i+1 , . . . , t N ) if 3 < i < N - 2. 



The optimal predictive density function of Section 1.3 is a special case of 



gi = arg inf sup reg; (g, t 4 (t) ; 6) , 

se£(7i) te7i 

the £ (Ti)- optimal predictive density function relative to (Q,Wi). 

Theorem 3. Given some i G {1, . . . , iV} and t G Ti X • • • X 7at ; / (tj (t)) ; tj (t)J dt < 

00, then, for all ti G Ti, the £ (Ti)- optimal predictive density function relative to (G,Wi) 
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satisfies 



9i (*i 



U (6i (t);t 



j^L, (^(ti (t));ti(t))tft 



(6) 



Proof. The present argument follows that used to prove Theorem [TJ Assume, contrary to 
the claim, that the density function that satisfies equation ([6]) for all tj G 7* is not the 
optimal predictive density function relative to (Q,Wi). The substitution Li (§i(t);tj = 
Li (§i (tj (ij)) ; tj (tj) j demonstrates that the ratio g^ (tj) /Lj (tj (tj)) ; tj (tj)J does not de- 
pend on tj. It follows that, for any gi & £ (%) \ {gi}, there is a tj £ % such that $j (tj) /Lj (tj (tj)) 
gi (U) I Li (§i (tj (tj)) ; tj (tj)J . Therefore, given any & G £ (7i) \ there is a tj G 71 such 
that regj (#j, tj (tj) ; 0) < regj (gi, tj (tj) ; G), which contradicts the assumption. □ 

For any Xi G X n , the quantity <fc (t (a^)) = ^(r(xi);G) is the normalized maximum 
weighted likelihood (NMWL) with respect to G or, more precisely, with respect to (Q,Wi). 

Example 4. When the constraint that Ylj=i w » — 1 is relaxed, NMWL generalizes various 
previous NMLs as follows. If r (xi) = x\ for some X\ G X ni and if N — 1, the NMWL reduces 
to the probability density (a^) with W\ y i = 1 and thus to f (x), the NML of equation ([3]). 
For an observed vector (y x , ... ,y ni ) G A"" 1 , assigning N — 2, 9\ — 8 2 , h = (y x , ... , y ni -i), 
and t2 = demonstrates that the prominent conditional NMLs are NMWLs in the case of 
IID data. In particular, Griinwald (2007, §11.4.2) considered^ ((ti,t 2 )) withu^i = u>i j2 



1. 



Conversely, Rissanen (2007, §5.2.4) and Rissanen and Roos (2007) studied g 2 ((ti, ^2)) with 
u>2,i = ^2,2 = 1; thereby facilitating computation of the normalizing constant in equation 
(|6| since the integration is only over a scalar. The main drawback of applying conditional 
NMLs to the IID setting is the arbitrary nature of choosing an observation x 2 to leave out 



since the observations are not ordered in time (Griinwald, 2007, §11.4.3). The same issue 
arises in Bayesian model selection when an improper prior is conditioned on a minimal 
training sample before computing the Bayes factor. A popular solution is to take geometric 



or arithmetic averages over all possible minimal training samples (Berger and Pericchi, 2004). 



Analogous approaches to IID applications of conditional NMLs would likewise depend on 
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arbitrary choices of averages and of training sample sizes (qTTT) . 



2.2.2 Approximate predictive distribution 

A computationally efficient approximation to the NMWL is available if: 

1. The weight of any comparison in focus is equal to that of any other comparison when 
it is in focus, i.e., Wa = u^i for all i. 

2. The weight of each comparison not in focus is equal to that of any other comparison 
not in focus, i.e., Wij = for all i ^ j. 

3. The sample sizes and sample spaces are equal, i.e., rii = ri\ and % = Ti for all i. 

4. All comparisons share a single family, i.e., Q ni ,i — Gm,i and L\ = Lj for all i. 

Under those equal weight conditions, there is an approximate weight Wn + i such that Wn + i = 
Wu for all i and an approximate weight wj such that Wj = (N — 1) N~ 1 Wij for all i and j 
except i = j. Then ^2^=1 = Nwi + wn+i = (N — 1) u>i )2 + = 1. 

For any t G 7i, let t (t) denote (ti, . . . , ijv, t) G 7j +1 - For inference about 9{ on the basis 
of t, the approximate weighted likelihood function L (•; t (i)J : © — > [0, oo) is defined by 

N 

logL [Out (t)) = ^ Wj log Lj (6j- tj) + w N+1 log Lj (6j, t) 

3=1 

1-w N 

= — ^r^"5Z logLl (0*; *i) + ^i,i lo g L i (0*;*) > 

the second equality implied by the equal weight conditions. Let 9 (t (t)) = arg sup ege L [9; t (t)) . 
The following theorem indicates that the exact NMWL ^ is approximated by 

Z 4 (^(t);t) 
f Ti L[9 (t(t));t(f))cft 
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which may be quickly calculated even for large iV since the denominator, not depending 
on i, need only be computed once. In the theorem and its supporting lemmas, T (t) = 
(Ti, . . . , Tjv,t), and ^ denotes almost sure convergence as iV increases with n± fixed. 

Lemma 5. If the equal weight conditions hold and if T%, . . . , Tn are drawn independently 
from a mixture distribution, then, for all 9 G G and t G %, 

logL (6; T (t)) - logL, (0; T t (t)) 4 0. (7) 

Proof. According to the equal weight conditions, 

logL(0;T (t)) -logZ^jT^t)) = 

1 - ~ N \ ( \ - N 

- 1 ^^°SL 1 (e;T j )+iv N+1 logL 1 (9;T i )U{ w -f ^ log L\ (0; Tj) + wa log L x (0; T ij 

3=1 J V j^i;i=l 

( 1 N 1 N \ 

= (l-w N+1 )[-J2^gL 1 (e ] T J )- w - J logLi^;^)!. 

V 3=1 j¥=i;i=i / 

The second factor almost surely vanishes by the law of large numbers. □ 

Lemma 6. Under the assumptions of Lemma^ the stipulations that 9 (t)^ and 0j (Tj (t)) 
are almost always unique for all t G % and that Li (•; Tj) is almost surely continuous on G 
for alii = N imply that, for all i = 1, . . . , N and t G %, 9 (f (t)) - 0< (Ti (t)) a -4 0. 

Proof. By Lemma |5j equation ^ holds for all G G. Thus, since almost sure convergence 



is preserved under almost surely continuous transformations (Serfling, 1980, §1.7) 



arg sup eee L f 0; T (t) \ - arg sup ee0 L t (9; T) 4 0. □ 

Theorem 7. Under the assumptions of Lemma [6| the difference between the approximate 
and exact parametric complexities almost surely vanishes: 

f L(o(r (t)) ; t (t)) - / Li (0, (T, (t)) ; T, (t)) dt a 4 0. (8) 

»/ 7i t/ 11 
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Proof. Combining the results of Lemmas [5] and [6] gives 



L (9 (T (t)J ; T (t) J - U (9 t (T< (t)) ; T, (t) J ^ 
for all t G 7i since 71 = 7i and the functions are almost surely continuous by assumption. □ 



2.3 Optimal discrimination information 

For any ©' C ©, let cji (if, ©') denote the optimal predictive density function relative to 



({go : 9 G ©'} , Wj) as defined in Section 2.2.1 For any O , ©i C 0, the optimal information 



in x for discrimination in favor of the hypothesis that 9i G 0i over the hypothesis that 
9i G ©o is 

M©i,©o) = -log^ i (t f ;e )-(-logft(* i ;e 1 )), 



Kullback 


( 


1968 


), 


Rissanen (1987|), Bickel 


( 


2010c 


), and 


Bickel 



(2010b). The approximate optimal information Ij(©i,©o) is defined identically except 
with gi in place of j^. 7j(©i,©o) is not restricted to the case of smoothness conditions 
on {go : 9 G ©}, but applies to any problem of selecting one of two models. 

Since g~i (if, ©i) /g^ (tjj {#o}) for 9q G ©is a likelihood ratio, the discrimination information 



has the universal bound on the probability of misleading evidence under 9 = 9 (Royall, 2000 



Bickel, 2010b). The next lemma and theorem bear on whether the optimal information for 
discrimination is an interpretable measure of evidence in that the probability of observing 
misleading information converges to as rij — > oo. Let 9{ (T; ©') = argsup eg6 / Lj [9\ T) for 
i — 1, . . . , N and $o (£; ©') = argsup^gg/ Lj (9; t) given any ©' C 0. 

Lemma 8. Suppose = IR D , 9 = (6>i, . . . , 9d) T , rij = O {nt) for all j,k G {1,...,N}, 
Qn,i — Qn,i an d Li = L\ for all i G {1, . . . ,N} and sufficiently large n, and t (x) = x for 
all x G X n , which implies that % = X n , t\ = Xi, and T± — Xi. Assume also that for some 
i G {1, . . . , iV}, there exists an open, bounded set ©' C © on which L (•; Xi) is almost surely 
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continuous and such that 



1 



log/ L i (§ (t;G');t)dt=^\og^ + log 



E 



d 2 \ng e (X t 



dede T 



d9 + o(l). (9) 



Ti 



1 



.-. log / U ( 9i (T t (t) ; 0') ; T, (t) ) dt = £ log ^ + lo. 

2 27T ./ e / \/ n 



d6d6 T 



d9 + o(l) 



almost surely holds for any weights that satisfy P (lim ni _ ) . D0 Wu = 1) = 1 and Ylf=i w ij = 1- 

Proof. The continuity condition and the constraints on the weights and sample sizes ensure 
that Li (§i (T { (t) ; 6') ; T, (t)) ^ U (§ (t; 6') ;t) as n { -> oo for all t E %. □ 



The assumptions of Lemma [8] are broadly applicable since equation ^ holds under 
general regularity conditions ( jRissanen 1996). The result will now be extended to non- 
bounded parameter spaces. 

Theorem 9. Suppose that ©i C @ ; that rij = O (rik) for all j,k G {1, . . . , N}, and that P 
is the sampling distribution of T. Assume also that for any i G {1, . . . , N}, there exists an 
open, bounded set 6' C 9i such that 



P ( lim log / L t (§i (Ti (t) ; 0') ; T, (t)) dt = oo] = 1. 



(10) 



.-. P ( lim log \ Li (§i (Ti (t) ; ©0 ; T, (t)) dt = oo] = 1. 
\ni->oo J % V / J 

Proof. Let % = |t G % : 0< (T { (t) ; ©0 G ©'} to expand f % U (b { (Ti (t) ; ©0 ; Ti (t) ) eft as 

jf Zj (Ti (t) ; 60 ; Ti (t)) + jf U fa (T { (t) ; 00 ; T, (t)) dt 
Thus, since U (§< (T t (t) ; ©0 ; T, (t)) = L { (§i (Ti (t) ; ©') ; Ti (tj) for all t G % and 



Li (Oi (Ti (t) ; 60 ; T, (t)) > Li (§i (Ti (t) ; 6') ; T* (t) 
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for all t in non-empty 7i\T given any sufficiently large n« 



U ft (T< (t) ; 60 ; T, (*)) dt > J L< (T< (t) ; 0') ; T, (*)) dt 



follows, where the equality and both inequalities hold with P-probability 1. 



□ 



Since the claim of Lemma [8] implies equation (10), Theorem [9] applies to the wide class of 
models satisfying the regularity conditions of Rissanen (1996). The largely overlapping regu- 
larity conditions of Sin and White (1996) then ensure that linin^oo P (7j (6i, {9q}) > 0) = 



when there is no 9 G such that gg is closer in Kullback-Leibler divergence than gg to the 
marginal distribution P (Tj G •). In the special case of correct model specification consid- 



ered in Royall (2000) and Bickel (2010b), the equation holds for all P admitting g do as the 
marginal density of Tj. 



2.4 Single-observation weights 

This section defines single-observation weights as the components of Wi such that for every 
i G 1, . . . ,N that all incidental data (all Xj with j ^ i) together have the weight of one 
observation in the focus vector x^ fX/^i w ij = w u/ n i) an d that each comparison other than 
the ith has equal weight (Wj,k ^ iwij = w^.). Solving those equations and Ylf=i w ij = 1 



v-l 



uniquely gives wa = 1 — (n* + 1) 1 and i ^ j =>■ tOy = (rii + l)" 1 (N — 1) 

If there is only a single comparison, then its observed statistic t\=r (x\) is supplemented 
by a pseudo-statistic to, a scientifically meaningful value in 71 that does not depend on x±. 
For example, to might be J tgg Q (t) dt, the expectation value of T\ under 9 = 9q. (Similarly, 



Kass and Wasserman (1995) considered the use of a prior with the Fisher information of a 



single pseudo-observation.) The use of t = (to, ti) and N = 2 with single-observation weights 
then entails that 



log U {9 X ; t) = (m + I)" 1 log U {9f, t ) + (1 - (m + l)" 1 ) log Lt {9 X ; h) . (11) 



18 



For a smoother transition from a single comparison to multiple comparisons, the pseudo- 
statistic may be assigned the same weight as each of the N — 1 incidental statistics among 
t u ...,t N , i.e., wij = {ni + l)' 1 N' 1 for all j 6 {0, 1, . . . , N} \ {i}. 

The following result applies whether there is a single comparison or multiple comparisons. 

Corollary 10. Assume the components of Wi are single-observation weights, that = n\ 
for all i = 1, . . . , N , and that Q ni) i = Q nx ,\ f or all i = 1, . . . , N . IfT\,..., T/v are independent 
and drawn from a mixture distribution, if 9 and §i (Tj (t)) are almost always unique 

for all t G %, and if Li (•; Tj) is almost surely continuous on for all i = 1, . . . , N, then 
equation (|sj) holds. 

Proof. All the conditions of Theorem [7] are given except for the equal weights condition, 
which follows from the single-observation weights assumption, the equality of the sample 
sizes, and the commonality of the family of distributions. □ 



3 Case studies 

In the following models, j f^r x \ (x) dx = oo, rendering the unweig hted NML Q useless. The 
NMWL (g can be used instead since J^- Li (tj (t)) ; ti (t)j dt < oo. 

Results of two separate NMWL analyses are presented for each application. The first 
uses multiple comparisons for inference relevant to each comparison Q. The second uses 
j tgg (t) dt = in place of data associated with other comparisons, as if there were only a 



single comparison ( 11 ). All plots use the binary logarithm to express information in bits and 



display a different value for each comparison. 



3.1 Single and multiple populations 

Before addressing a problem in contemporary biology, the proposed methodology will be 



illustrated using a simple data set that has motivated both Bayesian (Rubin, 1981) and 



weighted likelihood (Wang, 2006) approaches. The reduced data consist of the estimated 
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Figure 1: Information (bits) favoring the hypothesis that the test score at a site was affected 
by the treatment. 

average effect of a training program on SAT scores and an estimated standard error of the 



effect estimate for each of eight test sites. Following the tradition continued by Wang (2006) 



the standard errors a\, . . . , erg are considered known, and the effect estimates are modeled as 
normal observations with unknown means 9x, . . . ,9$. Thus, N = 8 and {gi,e, '■ 9 G @} is the 
family of distributions, where is the normal density of mean 9\ and standard deviation 
<7j. For the ith site, 9i ^ is the alternative hypothesis and 9i = is the null hypothesis. 
Fig. nj displays Jj (R\ {0} , {0}), the resulting approximate discrimination information, 



with Jj (R\ {0} , {0}), the exact discrimination information. As in Section 2.4, the weight 
of a single observation is assigned either to 0, the null hypothesis value ("information from 
null"), or to the incidental testing sites ("information from sites"). The resulting information 
values are barely distinguishable. 



3.2 Single and multiple biological features 

In typical experiments measuring gene expression or the abundance of proteins or metabo- 
lites, the primary question is whether the expectation value of a logarithm of the expression 
or abundance of each feature is affected by a treatment, disease, or other perturbation. Since 
that question is equivalent to that of whether CV^ 1 , the inverse coefficient of variation for 
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the zth feature, is 0, the data reduction strategy of Example [2] often proves effective even 
if the magnitude of CV^ 1 is not of direct interest. CV^ 1 has a one-to-one correspondence 



to the proportion of the feature-feature pairs with abundance ratios greater than 1 (Bickel 



2004 



2008). In addition, CV" 1 is often of more scientific interest than the mean since small 



changes in numbers of biomolecules can have a strong influence on downstream processes. 
The method of Example [2] is applied to the proteomics data set of Alex Miron's lab 



at the Dana-Farber Cancer Institute (Li, 2009), with Xij and yij as the logarithms of the 
abundance levels of the zth of N = 20 proteins in the jth woman with and without breast 



cancer, respectively, after the preprocessing of Bickel (2010b). Likewise, £j and r/i are the 
expectation values of the random variables and Y^. Each of two breast cancer groups 
(one of 55 HER2-positive women and the other of 35 women mostly-ER/PR-positive) were 
compared to a control group of 64 women. Since 9 t = \CV^\ and thus 6 = [0,oo), the 
competing hypotheses for the ith protein are 9i > and 6{ = 0. 

The left panel of Fig. [2] displays the approximate information for discrimination in favor 
of the alternative hypothesis that 9i ^ over the null hypothesis that 9i = by weighing 



the incidental proteins as a single observation (£2.4). Ii ((0, oo) , {0}), the approximate opti- 



mal information, is compared to log ^ [9uhE',tij / 9i (0;tj)J. Here, #mle is common to all 
proteins, denoting the maximum likelihood estimate (MLE) defined under the assumptions 
that 9i G {0,# a i t .} for some 6^. > for all i and that the test statistics are independent 
(Bickel, 2010b[ ). The right panel of Fig. [2] contrasts the widely varying regret of the MLE 
information with the constant regret of the optimal information. 



Giving the null hypothesis the weight of a single observation (11), as if the abundance 
level of only one protein were measured, results in information values that are visually 
indistinguishable from those of Fig. [2] Nonetheless, some effect of the weighting method is 
perceptible for much smaller sample sizes. For example, Fig. [3] displays the effect of using the 
null hypothesis weights instead of the protein weights on Ii ((0, oo) , {0}) for two randomly 
selected patients from each breast cancer group and from the healthy group. Even in this 
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Figure 2: Left panel: Discrimination information (bits) favoring the hypothesis that the 
abundance level of a protein differs by disease status versus CVj . Right panel: The corre- 
sponding regret versus CVj . (CV i denotes the difference in sample means divided by the 
sample standard deviation for the ith protein.) 



extreme case, only one protein out of 20 in the right-side panel has a different evidence grade 
(Table [T]) depending on how the weights are computed. 



4 Discussion 

In both of the case studies of Section |3j the use of data associated with comparisons other 
than the comparison currently in focus in place of an artificial data point determined by 
the null hypothesis has little effect on the information for discrimination. In the second 
application, little information was lost for inference about a single protein were the other 19 
absent except when the sample size was reduced to n, = 4. Thus, the use of single-observation 



weights robustly addresses the infinite-complexity issue with NML raised in Section 1.3 

The insensitivity to the use of incidental information also suggests that the NMWL 
solution to the incidental-information issue raised in the same section is a measure of evidence 
that has the same interpretation for any number of comparisons. By contrast, p- values 
adjusted to control error rates and, to a lesser extent, posterior probabilities from hierarchical 
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Figure 3: Discrimination information (bits) favoring the hypothesis that the abundance level 
of a protein differs by disease status (n = 2 women per group) using weights from the null 
hypothesis versus that using weights from the incidental proteins. 

Bayesian models, tend to vary so greatly between a single comparison and a large number of 
comparisons that they require researchers to separately build the intuition needed to interpret 
statistical reports for small numbers of comparisons, medium numbers of comparisons, large 
numbers of comparisons, etc. This shortcoming of traditional approaches to the multiple 
comparisons problem is especially glaring when an article reports various degrees of adjusting 
p-values for data types involving very different numbers of features. 



As seen in Section 3.1 the optimal information for discrimination can indicate strong 
evidence for a simple null hypothesis. While in principle the Bayes factor can also favor the 
null hypothesis, prior distributions commonly used in practice often can provide only weak 
Bayes-factor support for a simple null hypothesis that corresponds to the data-generating 



distribution (Johnson and Rossell, 2010). The ability of the information for discrimination to 
indicate whether the evidence in the data is strongly in favor of the alternative hypothesis, 
strongly in favor of the null hypothesis, or insufficient to strongly favor either hypothesis 
(Table [TJ guards against the prevalent misinterpretation of a high p- value as evidence for a 
null hypothesis. More important, the discrimination information provides scientists a reliable 
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tool designed to objectively answer the questions they ask of their data. 
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